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Abstract 

Formalizing linguists' intuitions of language change as a dynamical system, we quantify the time course 
of language change including sudden vs. gradual changes in languages. We apply the computer model to 
the historical loss of Verb Second from Old French to modern French, showing that otherwise adequate 
grammatical theories can fail our new evolutionary criterion. 
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1 Introduction 



Language scientists have long been occupied with de- 
scribing phonological, syntactic, and semantic change, 
often appealing to an analogy between language change 
and evolution, but rarely going beyond this. For in- 
stance, Lightfoot (1991, chapter 7, pp. 163— 65ff.) talks 
about language change in this way: "Some general prop- 
erties of language change are shared by other dynamic 
systems in the natural world 1 . Here we formalize these 
intuitions, to the best of our knowledge for first time, 
as a concrete, computational, dynamical systems model, 
investigating its consequences. Specifically, we show 
that a computational population language change model 
emerges as a natural consequence of individual language 
learnability Our computational model establishes the fol- 
lowing: 

• Learnability is a well-known criterion for the ad- 
equacy of grammatical theories. Our model pro- 
vides an evolutionary criterion: By comparing the 
trajectories of dynamical linguistic systems to his- 
torically observed trajectories, one can determine 
the adequacy of linguistic theories or learning al- 
gorithms. 

• We derive explicit dynamical systems correspond- 
ing to parametrized linguistic theories (e.qg. Head 
First/Final parameter in HPSG or GB grammars) 
and memoryless language learning algorithms (e.g. 
gradient ascent in parameter space). 

• We illustrate the use of dynamical systems as a 
research tool by considering the loss of Verb Sec- 
ond position in Old French as compared to Mod- 
ern French. We demonstrate by computer model- 
ing that one grammatical parameterization in the 
literature does not seem to permit this historical 
change, while another does. We can more accu- 
rately model the time course of language change. In 
particular, in contrast to Kroch (1989) and others, 
who mimic population biology models by imposing 
an S-shaped logistic change by assumption, we ex- 
plain the time course of language change, and show 
that it need not be S-shaped. Rather, language- 
change envelopes are derivable from more funda- 
mental properties of dynamical systems; sometimes 
they are S-shaped, but they can also be nonmono- 
tonic. 

• We examine by simulation and traditional phase- 
space plots the form and stability of possible "di- 
achronic envelopes" given varying conditions of al- 
ternative language distributions, language acqui- 
sition algorithms, parameterizations, input noise, 
and sentence distributions systems. 

2 The Acquisition-Based Model of 
Language Change 

We first show how a combination of a grammatical the- 
ory and a learning paradigm leads directly to a formal 
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Figure 1: Time evolution of grammars using a greedy 
learning algorithm. The x-axis is generation time, e.g., 
units of 20-30 years. The y-axis is the percentage of 
the population speaking the languages as indicated on 
the curves, e.g, S(ubject) V(erb) O(bject), with no Verb 
Second= SVO-V2. 



dynamical systems model of language change. 

First, informally, consider an adult population speak- 
ing a particular language 2 . Individual children attempt 
to attain their caretaker target grammar. After a fi- 
nite number of examples, some are successful, but oth- 
ers may misconverge. The next generation will therefore 
no longer be linguistically homogeneous. The third gen- 
eration of children will hear sentences produced by the 
second — a different distribution — and they, in turn, will 
attain a different set of grammars. Over generations, the 
linguistic composition evolves as a dynamical system. In 
the remainder of this paper we formalize this intuition, 
obtaining detailed figures like the one in 1, showing the 
evolution of language types over successive generations 
within a single community. We return to the details 
later, but let us first formalize our intuitions. 

Grammatical theory, Learning Algorithm, 
Sentence Distributions 

1. Denote by Q, a family of possible (target) gram- 
mars. Each grammar g £ Q defines a language L(g) C 
S* over some alphabet £ in the usual way. 

2. Denote by P, the distribution with which sentences 
of S* are presented to the individual learner (child). 
More specifically, let Pi be the distribution with which 
sentences of the ith grammar g{ £ Q are presented if 
there is a speaker of g{ in the adult population. Thus, 
if the adult population is linguistically homogeneous 
(with grammar g\) then P = P\. If the adult popula- 
tion speaks 50 percent L(g\) and 50 percent £(#2) then 
P=l Pl + lp 2 . 

3. Denote by A the learning algorithm that children 
use to hypothesize a grammar on the basis of input data. 



J One notable exception is Kroch, 1989, whose account we 
explore below. 



2 In our framework, this implies that the adult members 
of this population have internalized the same grammar. 



If d n is a presentation sequence of n randomly drawn 
examples, then learnability (Gold, 1967) requires (for 
every target grammar gt), 

Prob[A(d n ) = g t ] > n -oo 1 

We now define a dynamical system by providing its 
two necessary components: 

A State Space (S): a set of system states. Here, the 
state space is the space of possible linguistic composi- 
tions of the population. Each state is described by a 
distribution P pop on Q describing the language spoken 
by the population. 3 

An Update Rule: how the system states change from 
one time step to the next. Typically, this involves spec- 
ifying a function, /, that maps s t G S to St+i 4 

In our case the update rule can be derived directly 
from learning algorithm A because learning can change 
the distribution of languages spoken from one genera- 
tion to the next. For example, given P pop ,t, we see 
that any any u> G S* is presented with probability 

The learning algorithm A uses the linguistic data (n 
examples, indicated by d n ) and conjectures hypothe- 
ses (A(d n ) G Q). One can, in principle, compute this 
probability 5 with which the learner will develop an arbi- 
trary hypothesis, hi, after n examples: 



Finite Sample: Prob[A(d n ) = hi] = p n (hi) (1) 

Learnability requires p n (gt) to go to 1, for the unique 
target grammar, g t , if such a grammar exists. In gen- 
eral, there is no unique target grammar since we have 
nonhomogeneous linguistic populations. However, the 
following limiting behavior can still exist: 



Limiting Sample: lim Prob[A(d n ) 



Pi (2) 



Thus, with probability p„(hi), 6 an arbitrary child will 
have internalized grammar hi. Thus, in the next genera- 
tion, a proportion p n (hi) of the population has grammar 
hi, i.e., the linguistic composition of the next generation 
is given by P pop ,t+i{hi) = ft(or p n {hi)). In this fashion, 
we have an update rule, 



P 



pop,t 



P 



pop,t-\-l 



As usual, one needs to be able to define a fr-algebra on the 
space of grammars, and so on. This is unproblematic for the 
cases considered in this paper because the set of grammars 
is finite. 

In general, this mapping could be fairly complicated. For 
example, it could depend on previous states, future states, 
and so forth; for reasons of space we do not consider all pos- 
sibilities here. For reference, see Strogatz (1993). 

The finite sample situation is always well defined; see 
Niyogi, 1994. 

Or p t , depending upon whether one wishes to carry out a 
finite sample, or a limiting sample analysis for learning within 
one generation. 



Generality of the approach. Note that such a dy- 
namical system exists for every choice of A, Q, and Pi 
(relative to the constraints mentioned earlier). In short 
then, 

(Q, A, {Pi}) — > T>( dynamical system) 
Importantly, this formulation does not assume any par- 
ticular linguistic theory, learning algorithm, or distribu- 
tion over sentences. 

3 Language Change in Parametric 
Systems 

We next instantiate our abstract system by modeling 
some specific cases. Suppose we have a "parameter- 
ized" grammatical theory, such as HPSG or GB, with 
n boolean-valued parameters and a space Q with 2" dif- 
ferent languages (in this case, equivalently, grammars). 
Further take the assumptions of Berwick and Niyogi 
(1994), regarding sentence distributions and learning: Pi 
is uniform on unembedded sentences generated by gi and 
A is single step, gradient ascent. To derive the relevant 
update rule we need the following theorem and corollar- 
ies, given here without proof (see Niyogi, 1994): 

Theorem 1 Any memoryless incremental algorithm 
that attempts to set the values of the parameters on 
the basis of example sentences, can be modeled exactly 
by a Markov Cham. This Markov chain has 2" states 
with state corresponding to a particular grammar. The 
transition probabilities depend upon the distribution P 
with which sentences occur, and the learning algorithm 
A (which is essentially a recursive function from data to 
hypotheses). 

Corollary 1 The probability that the learner internal- 
izes hypothesis hi after m examples (solution to equa- 
tion 1) is given by, 

Prob[ Learner's hypothesis = hi G Q after m examples] 

= {^(l,...,l)'T™}[i] 

Similarly, making use of limiting distributions of 
Markov chains (see Resnick, 1992) one can obtain the 
following: 

Corollary 2 The probability that the learner internal- 
izes hypothesis hi "in the limit" (solution to equation 2) 
is given by 

Prob[ Learner's hypothesis = hi "in the limit"] 

= (1,...,1)'(7 -T + ONE)- 1 
where ONE is a -^ x -^ matrix with all ones. 

This yields our required dynamical system for 
parameter-based theories: 

1. Let 11-^ be the initial population mix. Assume Pi's 
as above. Compute P according from 7r-j_, and Pi's. 

2. Compute T (transition matrix) according to the 
theorem. 

3. Use the corollaries to the theorems to obtain the 
update rule, to get the population mix n^. 

4. Repeat for the next generation. 



4 Example 1: A Three Parameter 
System 

Let us consider a specific example to illustrate the deriva- 
tion of the previous section: the 3-parameter syntactic 
subsystem describe in Gibson and Wexler (1994) and 
Niyogi and Berwick (1994). Specifically, posit 3 Boolean 
parameters, Specifier first/final; Head first/final; Verb 
second allowed or not, leading to 8 possible gram- 
mars/languages (English and French, SVO— Verb sec- 
ond; Bengali and Hindi, SOV— Verb second; German and 
Dutch, SOV+Verb second; and so forth). The learning 
algorithm is single-step gradient ascent. For the mo- 
ment, take Pi to be a uniform distribution on unem- 
bedded sentences in the language. Let us consider some 
results we obtain by simulating the resulting dynamical 
systems by computer. Our key results are these: 

1. All +Verb second populations remain stable over 
time. Nonverb second populations tend to gain Verb 
second over time (e.g., English- type languages change to 
a more German type) contrary to historically observed 
phenomena (loss of Verb second in both French and En- 
glish) and linguistic intuition (Lightfoot, 1991). This 
evolutionary behavior suggests that either the grammat- 
ical theory or the learning algorithm are incorrect, or 
both. 

2. Rates of change can vary from gradual S-shaped 
curves (fig. 2) to more sudden changes (fig. 3). 

3. Diachronic envelopes are often logistic, but not al- 
ways. Note that in some alternative models of language 
change, the logistic shape has sometimes been assumed 
as a starting point, see, e.g., Kroch (1982, 1989). How- 
ever, Kroch concedes that "unlike in the population bi- 
ology case, no mechanism of change has been proposed 
from which the logistic form can be deduced" . On the 
contrary, we propose that the logistic form is derivative, 
in that it sometimes arises from more fundamental as- 
sumptions about the grammatical theory, acquisition al- 
gorithm, and sentence distributions. Sometimes a logis- 
tic form is not even observed, as in fig. 3. 

4. In many cases the homogeneous population splits 
into stable linguistic groups. 

A variant of the learning algorithm (non-single step, 
gradient ascent) yields figure 1 shown at the beginning of 
this paper. Here again, populations tend to gain Verb- 
Second over time. 

Next, see fig. 4 for the effect of maturation time on 
evolutionary trajectories. 

Finally, so we have assumed that the _P 8 's were uni- 
form. Fig. 5 shows the evolution of the L^ (V O S +V2) 
speakers as p varies. 

4.1 Nonhomogeneous Populations 

Note that instead of starting with homogeneous popu- 
lations, one could consider any nonhomogeneous initial 
condition, e.g. a mixture of English and German speak- 
ers. Each such initial condition results in a grammatical 
trajectory as shown in fig. 6. One typically characterizes 
dynamical systems by their phase-space plots. These 
contain all the trajectories corresponding to different ini- 
tial conditions, exhibited in fig. 7. 
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Figure 2: Percentage of the population speaking lan- 
guages of the basic forms V(erb) O(bject) S(ubject) with 
and without Verb second. The evolution has been shown 
upto 20 generations, as the proportions do not vary sig- 
nificantly thereafter. Notice the "S" shaped nature of 
the curve (Kroch, 1989, imposes such a shape by fiat us- 
ing models from population biology, while we derive this 
form as an emergent property of our dynamical model, 
given varying starting conditions). Also notice the re- 
gion of maximum change as the Verb second parameter 
is slowly set by increasing proportion of the population, 
with no external influence. 



Finally, the following theorem characterizes stable 
nonhomogeneous populations: 

Theorem 2 (Finite Case) A fixed point (stable point) 
of the grammatical dynamical system (obtained by a 
memoryless learner operating on the 3 parameter space 
with k examples to choose its mature hypothesis) is a 
solution of the following equation: 

8 

H' = (7r 1 ,...,7r 8 ) = (l,...,l)'(^7r 8 T 8 ) & 

8=1 

// the learner were given infinite time to choose its hy- 
pothesis, then the fixed point is given by 

8 

n' = (tti, . . . , tt 8 ) = (1, . . . , 1)'(J - J2 *iTi + ONE)- 1 

8 = 1 

where ONE is the 8x8 matrix with all its entries equal 
to 1. 

Proof (Sketch): Both equations are obtained simply 
by setting U(t + 1) = 11(f). I 

Remark: Strogatz (1993) suggests that higher dimen- 
sional nonlinear mappings are likely to be chaotic. Since 
our systems fall into such a class, this possible chaotic 
behavior needs to be investigated further; we leave this 
for future publications. 

5 The Case of Modern French 

We briefly consider a different parametric system (stud- 
ied by Clark and Roberts, 1993) as a test of our model's 
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Figure 5: The evolution of V(erb) O(bject) S(ubject) 
+Verb second speakers in a community given different 
sentence distributions, Pi's. The Pi's were perturbed 
(with parameter p denoting the extent of the pertur- 
bation) around a uniform distribution. The algorithm 
used was single-step, gradient ascent. The initial pop- 
ulation was homogeneous, with all members speaking a 
V(erb) O(bject) S(ubject) —Verb second type language. 
Curves for p = 0.05, 0.75, and 0.95 have been plotted as 
solid lines. If we wanted the population to completely 
lose the Verb second parameter, the optimal choice of p 
is 0.75 (not 1 as expected). 
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Figure 7: Subspace of a Phase-space plot. The 
plot shows the number of speakers of V(erb) O(bject) 
S(ubject) (—Verb second and +Verb second) as t varies. 
The learning algorithm was single step, gradient ascent. 
The different curves correspond to grammatical trajec- 
tories for different initial conditions. 
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Figure 6: Subspace of a Phase-space plot. The 
plot shows the number of speakers of V(erb) O(bject) 
S(ubject) (—Verb second and +Verb second) as t varies. 
The learning algorithm was single step, gradient ascent. 
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Figure 8: Evolution of speakers of different languages 
in a population starting off with speakers only of Old 
French. The "p" settings may be ignored here. 



Given this new initial condition, fig. 9 shows the pro- 
portion of speakers losing Verb second after one gener- 
ation as a function of the proportion of sentences from 
the "foreign" Modern French source. Surprisingly small 
proportions of Modern French cause a disproportionate 
number of speakers to lose Verb second, corresponding 
closely to the historically observed rapid change. 

6 Conclusions 

A learning theory (paradigm) attempts to account for 
how children (the individual child) solve the problem 
of language acquisition. By considering a population of 
such individual "child" learners, we arrive at a model of 
emergent, global, population language behavior. Conse- 
quently, whenever a linguist proposes a new grammat- 
ical or learning theory, they are also implicitly propos- 
ing a particular theory of language change, one whose 
consequences need to be examined. In particular, we 
saw the gain of Verb second in the 3-parameter case 
did not match historically observed patterns, but the 
5-parameter system did. In this way the dynamical sys- 
tems model supports the 5-parameter linguistic system 
to explain some changes in French. We have also greatly 
sharpened the informal notions of the time course of lin- 
guistic change, and grammatical stability. Such evolu- 
tionary systems are, we believe, useful for testing gram- 
matical theories and explicitly modeling historical lan- 
guage change. 
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